Introduction to Data and Statistics

Statistics are widely used in everyday life, from crime and sports to education and real estate. Whether reading a newspaper, watching TV, or browsing the internet, we often encounter statistical information based on samples. This data helps assess the accuracy of claims to help us all make better and informed decisions. Understanding statistical methods is essential for analyzing information thoughtfully, whether buying a house, managing a budget, or working in fields like economics, business, psychology, biology, or law. This beginning module introduces the fundamentals of statistics including the mindset for being a statistics student, learning the relevant definitions, explaining how data is collected, and how to identify reliable data.

Terminology for Statistics

Before learning about what all is involved in statistics, it is important to introduce a few key words that will be used all throughout our study of statics.

What is Data?

Data is any collections of observations (such as measurements, genders, survey responses, etc.).

Example : Data

  1. An instructor asks their class how many hours they studied last week and the class responded with various times ranging from 30 mins to 4 hours. All of these "times" given by the students would be considered Data.
  2. A survey online asks users to identify their favorite fast food restaurant. Two of the selections made are Zaxbys and McDonald's. These two selections are considered data.

What is Statistics?

Statistics is the science of planning studies and experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions based on the data.

Statistics operates by using studies, surveys, polls, and other data collection tools to gather information from a subset of a larger group, allowing insights about the entire group. This principle reflects the core purpose of statistics and the objective of this course: understanding some larger group by analyzing data from a representative subset.

To further understand this core principle, let's cover a few more key terms that will be used during this course as we progress towards our objective.

The Larger Group

A population is the complete collection of all individuals (scores, people, measurements, and so on) to be studied. The collection is complete in the sense that it includes all of the individuals to be studied.

Individuals often refer to people, but not always. For example, individuals can also refer to animals in a wildlife study, products in a quality control inspection, plants in a botanical experiment, or even stars in an astronomical survey.

If you do collect data from every individual, you have collected a census which is a collection of data from every member of the population.

Consider the example of products in a quality control inspection. As the manufacturer, you want all your products to be free of defects. You have two options: test every product or test a small selection and use the results to evaluate the entire product line. Most manufacturers choose the latter because testing every product is more expensive and may damage or destroy the items being tested.

This example highlights the difference between a census, which involves testing every product, and a sample, which involves testing only a subset.

The Smaller Group

A sample is a subset (or smaller group) of members selected from a population.

Some examples include:

  • Political polls study a randomly selected group of households across various states to estimate trends for the entire U.S. population. This approach is more efficient and cost-effective than conducting a full census, as political organizations and journalists lack the resources or time for a true census.
  • The professor surveys only 5 of their 30 students about their sleep habits, rather than collecting data from every student.
  • The top 8 scores from a competition are sampled to assess whether the overall skill level of the athletes is improving.

Example : Populations vs Samples

Say that a statistics professor is studying the amount of sleep his students get each night and is only able to collect data from ten of his thirty students. What is the sample and what is the population in this scenario?

Solution

The sample in this example will be the 10 pieces of sleep data the professor was able to obtain.

The population that the professor intends to study is ALL his students.

\[ \tag*{\(\blacksquare\)} \]

Example : Populations vs Samples Part 2

Determine the population and sample for the given situation: A Gallup poll is given to all eligible voters, and there are 2.3 million responses.

Solution

The sample in this scenario is the 2.3 million respondents/responses, and the population is all adults eligible to vote.

\[ \tag*{\(\blacksquare\)} \]

Example : Populations vs Samples Part 3

Determine whether the following data is from a population or a sample:

  • Part A: The age of every fourth person entering a grocery store.
  • Part B: The major for each student at a community college.

Solution

  • Part A: This data is from a sample because we are collecting age information for only some of the customers, not all of them.
  • Part B: This data is from a population because the community college has information about all students.

\[ \tag*{\(\blacksquare\)} \]

Many times, the difference between a population and a sample is not clear.  Be careful with the following examples—they represent common errors in distinguishing a population from a sample.

Example : Populations vs Samples Part 4

Identify the population and sample for this situation: A company surveys 850 of its employees and finds that 520 are satisfied with their job.

Solution

Many people assume the 850 employees is the population because it’s a large number, but this is incorrect. A population includes all members of a group, while specific numbers, like 850, generally refer to a sample.

In this case:

  • The 850 employees surveyed represent a sample of the population of all employees at the company.
  • The 520 is a statistic calculated from the sample, not the sample itself.

\[ \tag*{\(\blacksquare\)} \]

Example : Populations vs Samples Part 5

An ecologist wants to study the nesting habits of birds in a particular forest. They identify 1800 trees in the northwest region of the forest and randomly select 300 trees to observe. Of those, 120 trees contain nests.

  • Part A: What is the population the ecologist wants to study?
  • Part B: What is the sample they obtained?
  • Part C: About which population can the ecologist draw conclusions?

Solution

  • Part A: The ecologist wants to study all trees in the forest where birds might build nests. The 1800 trees is misleading because it only represents one part of the forest.
  • Part B: The sample is the 300 trees the ecologist selected to observe because samples include only the individuals from whom data is collected.
  • Part C: Since the ecologist only observed trees in the northwest region, they can only make inferences about the population of trees in the northwest region, not the entire forest.

\[ \tag*{\(\blacksquare\)} \]

Example : Populations vs Samples Part 6

We want to study the heights of students at Columbia State Community College so we set up a table in the student resources building to collect data from our fellow students as they come into the building. This college has an enrollment of 2000 students. We obtain the heights of 281 students over a course of three days. For this scenario, identify both the population and the sample as well as some factors preventing us from obtaining a census.

Solution

The population would be all 2000 students that are enrolled in the college as that would be the ideal amount of individuals to collect height data from.The sample is just the 281 student heights we got over the course of the three days.With us collecting data in this manner, this limits us from getting data from students that do not visit the student recourses building. Because of this, we probably do not have a representative sample meaning this sample would not be the best for trying to study the heights of ALL students at ABC College: this data could be used instead to just make inferences about students that use the student resources building.

\[ \tag*{\(\blacksquare\)} \]

Conclusion

Understanding the distinction between populations and samples is fundamental to statistical analysis. Populations represent the entire group under study, while samples are subsets used to make inferences about the population. Although a census provides comprehensive data, it is often impractical for large populations, making sampling a more efficient and feasible approach. Recognizing these differences and correctly identifying populations and samples in various contexts ensures accurate interpretations and better decision-making in studies.